Variant Discovery    ◾    129

cd ..

#7- Marking/removing duplicate alignments

mkdir dupRemBam

cd sortedbam

for i in $(ls *.bam);

do

samtools rmdup ${i} ../dupRemBam/${i} 2> ../dupRemBam/${i}.log

done;

cd ..

#8- Alignment pileup and variant calling using bcftools

mkdir variants

cd dupRemBam

for i in $(ls *.bam);

do

samtools index ${i}

done;

ls *.bam | rev | rev > bam_list.txt

freebayes \

-f ../ref/GCF_009858895.2_ASM985889v3_genomic.fna \

-C 5 \

-L bam_list.txt \

-v ../variants/sarscov2.vcf

cd ..

#9- Compress and index the VCF file

bgzip variants/sarscov2.vcf

tabix variants/sarscov2.vcf.gz

#to view contents

bcftools view variants/sarscov2.vcf.gz | less -S

We may notice that there is a little difference between the VCF file generated by bcftools

and the one generated by FreeBayes. The reason is that each program uses a different algo-

rithm for variant calling as discussed above.

4.2.2.2  GATK Variant Calling Pipeline

Genome Analysis Toolkit (GATK) [8] is a collection of command-line tools for variant dis-

covery and analysis of the sequence data acquired from the high-throughput sequencing

instruments. For installing the latest release of GATK, follow the installation instructions

available at the GATK website “https://gatk.broadinstitute.org/”. GATK uses the haplo-

type approach as FreeBayes. However, it uses advanced workflow for variant calling. The

GATK tools can be used individually or together as a complete pipeline for variant discov-

ery. GATK also provides complete workflows called GATK Best Practices for variant call-

ing tailored for specific cases. In the following, we will discuss the most commonly used

variant calling pipeline for GATK best practice. The following workflow assumes that the

acquired raw data in FASTQ format and that the quality control has been performed on the

data. The general GATK workflow consists of the following major steps: